{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working with large data using datashader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The various plotting backends supported by HoloViews (such as Matplotlib and Bokeh) each have  limitations on the amount of data that is practical to work with, for a variety of reasons.  For instance, Bokeh mirrors your data directly into an HTML page viewable in your browser, which can cause problems when data sizes approach the limited memory available for each web page in your browser.\n",
    "\n",
    "Luckily, a visualization of even the largest dataset will be constrained by the resolution of your display device, and so one approach to handling such data is to pre-render or rasterize the data into a fixed-size array or image before sending it to the backend.  The [Datashader package](https://github.com/bokeh/datashader) provides a high-performance big-data rasterization pipeline that works seamlessly with HoloViews to support datasets that are orders of magnitude larger than those supported natively by the plotting backends."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import holoviews as hv\n",
    "import datashader as ds\n",
    "from holoviews.operation.datashader import aggregate, shade, datashade, dynspread\n",
    "from holoviews.operation import decimate\n",
    "hv.extension('bokeh')\n",
    "decimate.max_samples=1000\n",
    "dynspread.max_px=20\n",
    "dynspread.threshold=0.5\n",
    "\n",
    "def random_walk(n, f=5000):\n",
    "    \"\"\"Random walk in a 2D space, smoothed with a filter of length f\"\"\"\n",
    "    xs = np.convolve(np.random.normal(0, 0.1, size=n), np.ones(f)/f).cumsum()\n",
    "    ys = np.convolve(np.random.normal(0, 0.1, size=n), np.ones(f)/f).cumsum()\n",
    "    xs += 0.1*np.sin(0.1*np.array(range(n-1+f))) # add wobble on x axis\n",
    "    xs += np.random.normal(0, 0.005, size=n-1+f) # add measurement noise\n",
    "    ys += np.random.normal(0, 0.005, size=n-1+f)\n",
    "    return np.column_stack([xs, ys])\n",
    "\n",
    "def random_cov():\n",
    "    \"\"\"Random covariance for use in generating 2D Gaussian distributions\"\"\"\n",
    "    A = np.random.randn(2,2)\n",
    "    return np.dot(A, A.T)\n",
    "\n",
    "def time_series(T = 1, N = 100, mu = 0.1, sigma = 0.1, S0 = 20):  \n",
    "    \"\"\"Parameterized noisy time series\"\"\"\n",
    "    dt = float(T)/N\n",
    "    t = np.linspace(0, T, N)\n",
    "    W = np.random.standard_normal(size = N) \n",
    "    W = np.cumsum(W)*np.sqrt(dt) # standard brownian motion\n",
    "    X = (mu-0.5*sigma**2)*t + sigma*W \n",
    "    S = S0*np.exp(X) # geometric brownian motion\n",
    "    return S"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Supported ``Elements``\n",
    "\n",
    "As HoloViews elements are fundamentally data containers, not visualizations, you can very quickly declare elements such as ``Points`` or ``Path`` containing datasets that may be as large as the available memory. For instance, this means you can immediately specify a datastructure that you can work with until you try to visualize it directly with either the matplotlib or bokeh plotting extensions as the rendering process may now be prohibitively expensive.\n",
    "\n",
    "Let's start with a simple example we can visualize as normal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(1)\n",
    "points = hv.Points(np.random.multivariate_normal((0,0), [[0.1, 0.1], [0.1, 1.0]], (1000,)),label=\"Points\")\n",
    "paths = hv.Path([random_walk(2000,30)], label=\"Paths\")\n",
    "\n",
    "points + paths"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These browser-based plots are fully interactive, as you can see if you select the Wheel Zoom or Box Zoom tools and use your scroll wheel or click and drag.  \n",
    "\n",
    "Because all of the data in these plots gets transferred directly into the web browser, the interactive functionality will be available even on a static export of this figure as a web page. Note that even though the visualization above is not computationally expensive, even with just 1000 points as in the scatterplot above, the plot already suffers from [overplotting](https://anaconda.org/jbednar/plotting_pitfalls), with later points obscuring previously plotted points.  With much larger datasets, these issues will quickly make it impossible to see the true structure of the data.  \n",
    "\n",
    "\n",
    "# Datashader operations\n",
    "\n",
    "If we tried to visualize the same two elements below which are just larger versions of the same data above, the plots would be nearly unusable even if the browser did not crash:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(1)\n",
    "points = hv.Points(np.random.multivariate_normal((0,0), [[0.1, 0.1], [0.1, 1.0]], (1000000,)),label=\"Points\")\n",
    "paths = hv.Path([0.15*random_walk(100000) for i in range(10)],label=\"Paths\")\n",
    "\n",
    "#points + paths  ## Danger! Browsers can't handle 1 million points!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Luckily, because elements are just containers for data and associated metadata, not plots, HoloViews can generate entirely different types of visualizations from the same data structure when appropriate.  For instance, in the plot on the left below you can see the result of applying a `decimate()` operation acting on the `points` object, which will automatically downsample this million-point dataset to at most 1000 points at any time as you zoom in or out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "decimate(points) + datashade(points) + datashade(paths)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Decimating a plot in this way can be useful, but it discards most of the data, yet still suffers from overplotting. If you have Datashader installed, you can instead use the `datashade()` operation to create a dynamic Datashader-based Bokeh plot.  (Here `datashade()` is just a convenient shortcut for the two main steps used in datashader, i.e., `shade(aggregate())`, which can also be invoked separately.) The middle plot above shows the result of using `datashade()` to create a dynamic Datashader-based plot out of an Element with arbitrarily large data.  In the Datashader version, a new image is regenerated automatically on every zoom or pan event, revealing all the data available at that zoom level and avoiding issues with overplotting.\n",
    "\n",
    "These two Datashader-based plots are similar to the native Bokeh plots above, but instead of making a static Bokeh plot that embeds points or line segments directly into the browser, HoloViews sets up a Bokeh plot with dynamic callbacks that render the data as an RGB image using Datashader instead.  The dynamic re-rendering provides an interactive user experience even though the data itself is never provided directly to the browser.  Of course, because the full data is not in the browser, a static export of this page (e.g. on anaconda.org) will only show the initially rendered version, and will not update with new images when zooming as it will when there is a live Python process available.\n",
    "\n",
    "Though you can no longer have a completely interactive exported file, with the Datashader version on a live server you can now change the number of data points from 1000000 to 10000000 or more to see how well your machine will handle larger datasets. It will get a bit slower, but if you have enough memory, it should still be very usable, and should never crash your browser as transferring the whole dataset into your browser would.  If you don't have enough memory, you can instead set up a [Dask](http://dask.pydata.org) dataframe as shown in other Datashader examples, which will provide out of core and/or distributed processing to handle even the largest datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Spreading\n",
    "\n",
    "\n",
    "\n",
    "The Datashader examples above treat points and lines as infinitesimal in width, such that a given point or small bit of line segment appears in at most one pixel. This approach ensures that the overall distribution of the points will be mathematically well founded -- each pixel will scale in value directly by the number of points that fall into it, or by the lines that cross it.\n",
    "\n",
    "However, many monitors are sufficiently high resolution that the resulting point or line can be difficult to see---a single pixel may not actually be visible on its own, and its color may likely be very difficult to make out.  To compensate for this, HoloViews provides access to Datashader's image-based \"spreading\", which makes isolated pixels \"spread\" into adjacent ones for visibility.  Because the amount of spreading that's useful depends on how close the datapoints are to each other on screen, the most useful such function is `dynspread`, which spreads up to a maximum size as long as it does not exceed a specified fraction of adjacency between pixels.  You can compare the results in the two plots below after zooming in:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "datashade(points) + dynspread(datashade(points))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both plots show the same data, and look identical when zoomed out, but when zoomed in enough you should be able to see the individual data points on the right while the ones on the left are barely visible.  The dynspread parameters typically need some hand tuning, as the only purpose of such spreading is to make things visible on a particular monitor for a particular observer; the underlying mathematical operations in Datashader do not normally need parameters to be adjusted.\n",
    "\n",
    "The same operation works similarly for line segments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "datashade(paths) + dynspread(datashade(paths))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multidimensional plots\n",
    "\n",
    "The above plots show two dimensions of data plotted along *x* and *y*, but Datashader operations can be used with additional dimensions as well.  For instance, an extra dimension (here called `k`), can be treated as a category label and used to colorize the points or lines.  Compared to a standard scatterplot that would suffer from overplotting, here the result will be merged mathematically by Datashader, completely avoiding any overplotting issues except local ones due to spreading:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%opts RGB [width=400] {+axiswise}\n",
    "\n",
    "np.random.seed(3)\n",
    "kdims=['d1','d2']\n",
    "num_ks=8\n",
    "\n",
    "def rand_gauss2d():\n",
    "    return 100*np.random.multivariate_normal(np.random.randn(2), random_cov(), (100000,))\n",
    "\n",
    "gaussians = {i: hv.Points(rand_gauss2d(), kdims=kdims) for i in range(num_ks)}\n",
    "lines = {i: hv.Curve(time_series(N=10000, S0=200+np.random.rand())) for i in range(num_ks)}\n",
    "\n",
    "gaussspread = dynspread(datashade(hv.NdOverlay(gaussians, kdims=['k']), aggregator=ds.count_cat('k')))\n",
    "linespread  = dynspread(datashade(hv.NdOverlay(lines,     kdims=['k']), aggregator=ds.count_cat('k')))\n",
    "\n",
    "gaussspread + linespread"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because Bokeh only ever sees an image, providing legends and keys has to be done separately, though we are working to make this process more seamless.  For now, you can show a legend by adding a suitable collection of labeled points:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%opts RGB [width=600]\n",
    "\n",
    "from datashader.colors import Sets1to3 # default datashade() and shade() color cycle\n",
    "color_key = list(enumerate(Sets1to3[0:num_ks]))\n",
    "color_points = hv.NdOverlay({k: hv.Points([0,0], label=str(k)).opts(style=dict(color=v)) for k, v in color_key})\n",
    "\n",
    "color_points * gaussspread"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `hv.NdOverlay` data structure merges all values along that dimension into the same image, (optionally) coloring each point or line to keep the values visibly different.  If you prefer to keep the values completely separate so that every image contains only one value along the `k` dimension, you can put the data into an `hv.HoloMap`, which lets you use indexes to choose a value for that dimension (or for multiple such dimensions).  If you have not indexed into the dimension(s) to choose one location in the multidimensional space, HoloViews will automatically generate slider widgets to allow you to choose such a value interactively:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%opts RGB [width=300] {+axiswise}\n",
    "\n",
    "datashade(hv.HoloMap(gaussians, kdims=['k'])) + datashade(hv.HoloMap(lines, kdims=['k']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can thus very naturally explore even very large multidimensional datasets.  Note that the static exported version (e.g. on anaconda.org) will only show a single frame, rather than the entire set of frames visible with a live Python server.\n",
    "\n",
    "\n",
    "## Working with time series\n",
    "\n",
    "Although Datashader does not natively [support datetime(64)](https://github.com/bokeh/datashader/issues/270) types for its dimensions, we can convert to an integer representation and apply a custom formatter using HoloViews and Bokeh:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bokeh.models import DatetimeTickFormatter\n",
    "def apply_formatter(plot, element):\n",
    "    plot.handles['xaxis'].formatter = DatetimeTickFormatter()\n",
    "    \n",
    "import pandas as pd\n",
    "drange = pd.date_range(start=\"2014-01-01\", end=\"2016-01-01\", freq='1D') # or '1min'\n",
    "dates = drange.values.astype('int64')/10**6 # Convert dates to ints"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%opts RGB [finalize_hooks=[apply_formatter] width=800]\n",
    "\n",
    "curve = hv.Curve((dates, time_series(N=len(dates), sigma = 1)))\n",
    "datashade(curve, cmap=[\"blue\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "HoloViews also supplies some operations that are useful in combination with Datashader timeseries.  For instance, you can compute a rolling mean of the results and then show a subset of outlier points, which will then support hover, selection, and other interactive Bokeh features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%opts Overlay [finalize_hooks=[apply_formatter] width=800] \n",
    "%%opts Scatter [tools=['hover', 'box_select']] (line_color=\"black\" fill_color=\"red\" size=10)\n",
    "from holoviews.operation.timeseries import rolling, rolling_outlier_std\n",
    "smoothed = rolling(curve, rolling_window=50)\n",
    "outliers = rolling_outlier_std(curve, rolling_window=50, sigma=2)\n",
    "\n",
    "datashade(curve, cmap=[\"blue\"]) * dynspread(datashade(smoothed, cmap=[\"red\"]),max_px=1) * outliers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rolling.function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the above plot will look blocky in a static export (such as on anaconda.org), because the exported version is generated without taking the size of the actual plot (using default height and width for Datashader) into account, whereas the live notebook automatically regenerates the plot to match the visible area on the page.\n",
    "\n",
    "# Hover info\n",
    "\n",
    "As you can see, converting the data to an image using Datashader makes it feasible to work with even very large datasets interactively.  One unfortunate side effect is that the original datapoints and line segments can no longer be used to support \"tooltips\" or \"hover\" information directly; that data simply is not present at the browser level, and so the browser cannot unambiguously report information about any specific datapoint. Luckily, you can still provide hover information that reports properties of a subset of the data in a separate layer (as above), or you can provide information for a spatial region of the plot rather than for specific datapoints.  For instance, in some small rectangle you can provide statistics such as the mean, count, standard deviation, etc:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%opts QuadMesh [tools=['hover']] (alpha=0 hover_alpha=0.2)\n",
    "from holoviews.streams import RangeXY\n",
    "\n",
    "fixed_hover = datashade(points, width=400, height=400) * \\\n",
    "    hv.QuadMesh(aggregate(points, width=10, height=10, dynamic=False))\n",
    "\n",
    "dynamic_hover = datashade(points, width=400, height=400) * \\\n",
    "    hv.util.Dynamic(aggregate(points, width=10, height=10, streams=[RangeXY]), operation=hv.QuadMesh)\n",
    "\n",
    "fixed_hover + dynamic_hover"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above examples, the plot on the left provides hover information at a fixed spatial scale, while the one on the right reports on an area that scales with the zoom level so that arbitrarily small regions of data space can be examined, which is generally more useful.\n",
    "\n",
    "As you can see, HoloViews exposes the functionality of Datashader, connecting this powerful rasterizer with Bokeh output so you can visualize data of any size in a browser using just a few lines of code. Because Datashader-based HoloViews plots are implemented as HoloViews operations, they support all of the same features as regular HoloViews objects, and can freely be laid out, overlaid, and nested together."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
